Skip to content

[VL] Enable native Parquet write for complex types (Struct/Array/Map)#11788

Open
Zouxxyy wants to merge 5 commits intoapache:mainfrom
Zouxxyy:dev/native-write
Open

[VL] Enable native Parquet write for complex types (Struct/Array/Map)#11788
Zouxxyy wants to merge 5 commits intoapache:mainfrom
Zouxxyy:dev/native-write

Conversation

@Zouxxyy
Copy link
Contributor

@Zouxxyy Zouxxyy commented Mar 19, 2026

What changes are proposed in this pull request?

Enable native Parquet write for complex types (Struct/Array/Map) in Velox backend.

Velox's parquet writer converts vectors to Arrow then writes via Arrow's Parquet writer, which natively supports nested types. The previous Scala-side type restrictions were unnecessary.

Changes:

  • Remove supportNativeWrite gate — no longer needed since supportWriteFilesExec handles validation
  • Remove Struct/Array/Map restrictions from validateDataTypes for Parquet (only YearMonthIntervalType remains blocked, as Arrow has no mapping for it)
  • Make validateDataTypes recursively check nested types for YearMonthIntervalType
  • Add tests for struct, array, map, and nested struct writes

Note:

  • Exclude SPARK-47546 variant binary tests (Spark 4.0/4.1). Velox parquet writer writes all struct fields as OPTIONAL, but variant type requires REQUIRED. This is a Velox-side limitation (exportToArrow unconditionally sets nullable flag). Fix needs Velox changes, out of scope for this PR.

How was this patch tested?

New tests in VeloxParquetWriteSuite. Existing tests pass.

Was this patch authored or co-authored using generative AI tooling?

Cooperated with: Kiro (Claude Opus 4.6)

@github-actions github-actions bot added CORE works for Gluten Core VELOX labels Mar 19, 2026
@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@Zouxxyy
Copy link
Contributor Author

Zouxxyy commented Mar 19, 2026

Write Validation Chain (spark3.4+) Generated-by: Kiro (Claude Opus 4.6)

 +-------------------------------------------------------------+
 |              Spark SQL Write Statement                        |
 |  (INSERT INTO / CTAS / INSERT OVERWRITE DIR / Hive INSERT)   |
 +-----------------------------+-------------------------------+
                               |
                               v
 +-----------------------------+-------------------------------+
 |  DataWritingCommandExec(cmd, child)                         |
 |    child contains WriteFilesExec (Spark 3.4+)               |
 +-----------------------------+-------------------------------+
                               |
 ==============================|====================================
  Gate 1: OffloadSingleNodeRules (standard columnar offload)
 ==============================|====================================
                               |
                               v
              +--------------------------------------+
              |  WriteFilesExec matched              |
              |  -> WriteFilesExecTransformer        |
              |  -> ColumnarWriteFilesExec           |
              +------+-------------------------------+
                     |
 ====================|=============================================
  Gate 2: doValidateInternal  (WriteFilesExecTransformer.scala)
 ====================|=============================================
                     |
                     v
              +--------------------------------------+
              |  2a. Constant complex type check     |
              |                                      |
              |  Project contains                    |
              |  Literal(ArrayType|MapType)?          |
              +------+----------------+--------------+
                 YES |                | NO
                     v                v
               +--------+   +-------------------------------+
               |FALLBACK|   | 2b. supportWriteFilesExec()   |
               +--------+   +-------------------------------+
                                     |
                                     v
                    +---------------------------------------------------+
                    |   Validation chain (in order, any fail = fallback) |
                    |                                                   |
                    |  (1) validateFileFormat                            |
                    |      OK ParquetFileFormat                         |
                    |      OK HiveFileFormat (Parquet SerDe)            |
                    |      X  Other formats                             |
                    |                    |                               |
                    |                    v                               |
                    |  (2) validateCompressionCodec                      |
                    |      X  brotli / lzo / lz4raw / lz4_raw           |
                    |                    |                               |
                    |                    v                               |
                    |  (3) validateFieldMetadata                         |
                    |      X  StructField.metadata != Metadata.empty    |
                    |                    |                               |
                    |                    v                               |
                    |  (4) validateDataTypes (recursive)                 |
                    |      X  YearMonthIntervalType                     |
                    |         (including nested in Struct/Array/Map)     |
                    |      OK StructType / ArrayType / MapType          |
                    |      OK all other primitive types                  |
                    |                    |                               |
                    |                    v                               |
                    |  (5) validateWriteFilesOptions                     |
                    |      X  maxRecordsPerFile > 0                     |
                    |                    |                               |
                    |                    v                               |
                    |  (6) validateBucketSpec                            |
                    |      X  Non Hive-compatible bucket write          |
                    +------------------------+--------------------------+
                                             |
                                      +------+------+
                                      | All passed? |
                                      +--+-------+--+
                                      NO |       | YES
                                         v       v
                                   +--------+   +------------------------------+
                                   |FALLBACK|   | 2c. doNativeValidation()     |
                                   +--------+   |  Substrait -> C++ validation |
                                                +---------------+--------------+
                                                                |
 ===============================================================|=========
  Gate 3: C++ Validator  (SubstraitToVeloxPlanValidator.cc)
 ===============================================================|=========
                                                                |
                                                                v
                              +--------------------------------------+
                              |  validate(WriteRel)                  |
                              |                                      |
                              |  3a. Recursively validate input plan |
                              |  3b. Parse input row type            |
                              |  3c. Validate partition column types:|
                              |      OK BOOLEAN / TINYINT / SMALLINT |
                              |      OK INTEGER / BIGINT             |
                              |      OK VARCHAR / VARBINARY          |
                              |      X  Other types as partition col |
                              |                                      |
                              |  Data columns: no type restriction   |
                              +------+----------------+--------------+
                                  NO |                | YES
                                     v                v
                               +--------+   +---------------------+
                               |FALLBACK|   | OK: Native Write    |
                               +--------+   +---------+-----------+
                                                      |
                                                      v
                              +--------------------------------------+
                              |           Execution Layer             |
                              |                                      |
                              |  VeloxColumnarWriteFilesExec          |
                              |       |                              |
                              |       v                              |
                              |  VeloxParquetDataSource               |
                              |       |                              |
                              |       v                              |
                              |  velox::parquet::Writer              |
                              |  (backed by Arrow Parquet Writer)    |
                              +--------------------------------------+

@jackylee-ch jackylee-ch requested a review from Copilot March 19, 2026 04:02
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enables Velox’s native Parquet write path to support complex Spark SQL types (Struct/Array/Map) by removing earlier type-gating and adjusting the Velox write validation to allow nested types for Parquet.

Changes:

  • Removes the schema-based “native write supported” gate (supportNativeWrite) and relies on the WriteFiles validation path instead.
  • Updates Velox write validation to allow Parquet StructType (still blocks YearMonthIntervalType) and reorders validation checks.
  • Simplifies Delta Parquet native-writability checks and adds new Velox Parquet write tests for complex/nested types.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
gluten-substrait/src/main/scala/org/apache/spark/sql/execution/datasources/GlutenWriterColumnarRules.scala Removes schema gate before enabling native write properties/adaptor injection.
gluten-substrait/src/main/scala/org/apache/gluten/backendsapi/BackendSettingsApi.scala Deletes supportNativeWrite from the backend settings API.
backends-velox/src/main/scala/org/apache/gluten/backendsapi/velox/VeloxBackend.scala Allows Parquet struct types; refactors/reorders native write validation chain.
backends-velox/src/test/scala/org/apache/spark/sql/execution/VeloxParquetWriteSuite.scala Adds native Parquet write coverage for struct/array/map and nested struct.
backends-velox/src-delta33/main/scala/org/apache/spark/sql/delta/files/GlutenDeltaFileFormatWriter.scala Removes dependency on deleted Parquet companion helper; forces native-writable flag.
backends-velox/src-delta33/main/scala/org/apache/spark/sql/delta/GlutenParquetFileFormat.scala Removes fallback branch + companion object; always uses Gluten Parquet OutputWriterFactory.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 101 to 106
case rc @ DataWritingCommandExec(cmd, child) =>
// The same thread can set these properties in the last query submission.
val format =
if (
BackendsApiManager.getSettings.supportNativeWrite(child.schema.fields) &&
BackendsApiManager.getSettings.enableNativeWriteFiles()
) {
if (BackendsApiManager.getSettings.enableNativeWriteFiles()) {
getNativeFormat(cmd)
} else {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a real issue. On Spark 3.4+, NativeWritePostRule only injects FakeRowAdaptor — the actual native/fallback decision is made by WriteFilesExecTransformer.doValidateInternal, which already rejects constant complex types. When validation fails, WriteFilesExec stays vanilla and FakeRowAdaptor is a harmless pass-through.

Also, #11787 (already merged) restricts NativeWritePostRule to Spark 3.3 only, so this code path is no longer reachable on 3.4+.

@github-actions
Copy link

Run Gluten Clickhouse CI on x86

1 similar comment
@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@Zouxxyy Zouxxyy changed the title [VL] Support native Parquet write for complex types (Struct/Array/Map) [VL] Enable native Parquet write for complex types (Struct/Array/Map) Mar 20, 2026
@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@Zouxxyy
Copy link
Contributor Author

Zouxxyy commented Mar 20, 2026

CC @jackylee-ch @jinchengchenghh @zhztheplayer for a look, thanks.

For reference, the Iceberg write path does not restrict complex types (Struct/Array/Map) at the format level too.

enableSuite[GlutenVariantSuite]
// TODO: Velox parquet writer marks all struct fields as OPTIONAL (nullable),
// but Spark's variant type requires REQUIRED fields. Needs Velox-side fix.
.exclude("SPARK-47546: invalid variant binary")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add it to #11088

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears I don't have permission to edit this. Perhaps I can modify it in the future—I actually have experience working with variants.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add sub issue to it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added #11803

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! LGTM

Copy link
Contributor

@jinchengchenghh jinchengchenghh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Contributor

@jackylee-ch jackylee-ch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically looks good to me. Left few comments

field.dataType match {
case _: StructType => Some("StructType")
case _: ArrayType => Some("ArrayType")
case _: MapType => Some("MapType")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any native test cases for HiveFileFormat's StructType, ArrayType, and MapType?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the review, add the test

@github-actions
Copy link

Run Gluten Clickhouse CI on x86

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CORE works for Gluten Core VELOX

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants